MSDS 6306 - Case Study 02

Huy Hoang Nguyen

12/05/2019

III. EDA1 - Analysis of each variable and some related variables

First, I will analyze the dataset by analyzing each variable by visualization. We have 870 observations in total (employees).

I also convert the other variables to factor. I will work on the dataset data1 (<-data0).

data1 <- data0

factorcolumns1 <- c("Education", "EnvironmentSatisfaction", "JobLevel", "NumCompaniesWorked", "PercentSalaryHike",  "StockOptionLevel", "TotalWorkingYears", "TrainingTimesLastYear", "YearsAtCompany", "YearsInCurrentRole",  "YearsSinceLastPromotion", "YearsWithCurrManager")
data1[,factorcolumns1] <- lapply(data1[,factorcolumns1], as.factor)
skim(data1)
Data summary
Name data1
Number of rows 870
Number of columns 31
_______________________
Column type frequency:
factor 25
numeric 6
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Attrition 0 1 FALSE 2 No: 730, Yes: 140
BusinessTravel 0 1 FALSE 3 Tra: 618, Tra: 158, Non: 94
Department 0 1 FALSE 3 Res: 562, Sal: 273, Hum: 35
Education 0 1 FALSE 5 3: 324, 4: 240, 2: 182, 1: 98
EducationField 0 1 FALSE 6 Lif: 358, Med: 270, Mar: 100, Tec: 75
EnvironmentSatisfaction 0 1 FALSE 4 4: 262, 3: 258, 2: 178, 1: 172
Gender 0 1 FALSE 2 Mal: 516, Fem: 354
JobInvolvement 0 1 FALSE 4 3: 514, 2: 228, 4: 81, 1: 47
JobLevel 0 1 FALSE 5 1: 329, 2: 312, 3: 132, 4: 60
JobRole 0 1 FALSE 9 Sal: 200, Res: 172, Lab: 153, Man: 87
JobSatisfaction 0 1 FALSE 4 4: 271, 3: 254, 1: 179, 2: 166
MaritalStatus 0 1 FALSE 3 Mar: 410, Sin: 269, Div: 191
NumCompaniesWorked 0 1 FALSE 10 1: 320, 0: 111, 3: 91, 4: 85
OverTime 0 1 FALSE 2 No: 618, Yes: 252
PercentSalaryHike 0 1 FALSE 15 11: 126, 13: 123, 14: 120, 12: 119
PerformanceRating 0 1 FALSE 2 3: 738, 4: 132
RelationshipSatisfaction 0 1 FALSE 4 4: 264, 3: 261, 1: 174, 2: 171
StockOptionLevel 0 1 FALSE 4 0: 379, 1: 355, 2: 81, 3: 55
TotalWorkingYears 0 1 FALSE 39 10: 132, 6: 72, 8: 61, 9: 58
TrainingTimesLastYear 0 1 FALSE 7 2: 309, 3: 308, 5: 75, 4: 73
WorkLifeBalance 0 1 FALSE 4 3: 532, 2: 192, 4: 98, 1: 48
YearsAtCompany 0 1 FALSE 32 1: 107, 5: 105, 3: 85, 2: 77
YearsInCurrentRole 0 1 FALSE 19 2: 223, 0: 151, 7: 136, 3: 68
YearsSinceLastPromotion 0 1 FALSE 16 0: 342, 1: 214, 2: 94, 7: 41
YearsWithCurrManager 0 1 FALSE 17 2: 202, 0: 166, 7: 131, 3: 76

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 36.83 8.93 18 30.0 35.0 43.00 60 ▂▇▇▃▂
DailyRate 0 1 815.23 401.12 103 472.5 817.5 1165.75 1499 ▇▇▇▇▇
DistanceFromHome 0 1 9.34 8.14 1 2.0 7.0 14.00 29 ▇▅▂▂▂
HourlyRate 0 1 65.61 20.13 30 48.0 66.0 83.00 100 ▇▇▆▇▇
MonthlyIncome 0 1 6390.26 4597.70 1081 2839.5 4945.5 8182.00 19999 ▇▅▂▁▁
MonthlyRate 0 1 14325.62 7108.38 2094 8092.0 14074.5 20456.25 26997 ▇▇▇▇▇

1. Monthly Income:

First, I will take a look at monthly income of employees by the following histogram.

x <- data1$MonthlyIncome 
h<-hist(x, breaks=10, col="steelblue", xlab="Monthly Income", 
        main="Histogram with Normal Curve for Monthly Income") 
xfit<-seq(min(x),max(x),length=40) 
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
yfit <- yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="red", lwd=2)

The following code will show the minimum and maximum salary of 870 employees in the dataset.

range(data1$MonthlyIncome)
## [1]  1081 19999

Then the Monthly Income is from 1081 USD to 19999 USD. By the histogram, it is right skewed. The question is that “Will we transform this variable before studying?”

By the histogram, we see most people have salary in the range [2000,4000] and second range is [4000,6000]. I will divide into 6 groups as follows: 1081 - <2000, 2000 - <4000, 4000 - <6000, 6000 - <10000, and from 10000 - <16000 and 16000 - <20000.

data1$IncomeGroup <- cut(data1$MonthlyIncome, c(0,2000,4000,6000,10000,16000,20000), labels = c("<$2000","$2000-$4000","$4000 - $6000","$6000-$10000","$10000-$16000","$16000-$200000"), include.lowest = TRUE)

By the following boxplot,

ggplot(data1, aes_string(x = "IncomeGroup", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Income Groups")+ xlab("Income Groups") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

and the following barplot,

ggplot(data1, aes_string("IncomeGroup")) + geom_bar(fill="steelblue") + xlab ("Monthly Income Group") + ylab("Number of Employees") +ggtitle("Monthly Income vs. Employee Groups")

we can see that:

  • The salary less than 2000 USD is considered as in lower class.

  • The salary from 2000 USD to 4000 USD is considered as in lower middle class and from 4000 USD to 6000 USD is considered as in middle class and from 6000 USD to 10000 USD is considered as in upper middle class.

  • The salary from 10000 USD to 16000 USD is considered as in the lower high class and from 16000 USD to 20000 USD is considered as in the high class.

Now I will study the relationship between Income Group and Attrition.

ggplot(data1, aes_string(x = "IncomeGroup", fill = "Attrition")) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  geom_abline(slope = 0, intercept = .16) + xlab("Income Group") + ylab("Percent numbers of employees")

  • Then employees with lower Monthly Income will have more chance to leave the current jobs.

2. Attrition:

I will observe the atrrition data first to see the percentage of employees who left jobs.

stats <- function(df, x) {
  df %>% group_by_at(x) %>% 
    summarise(Count = n(), Proportion = scales::percent(n()/dim(df)[1])) %>% 
    kable() %>% kable_styling(full_width = FALSE)
}

stats(data1, "Attrition")
Attrition Count Proportion
No 730 83.9%
Yes 140 16.1%
ggplot(data1, aes_string("Attrition")) + geom_bar(fill="steelblue") + xlab ("Attrition") + ylab("Number of Employees") 

  • Then there are 140/870 = 16.1 % left jobs (Attrition).

3. Age:

First, I will observe Ages of Employees in the dataset.

summary(data1$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   30.00   35.00   36.83   43.00   60.00
  • Then, Ages of Employees in this dataset are from 18 to 60 years old.

We can also use the following code to see the range of ages.

range(data1$Age)
## [1] 18 60

Now I will take a look at Age variable by the following histogram.

x <- data1$Age 
h<-hist(x, breaks=10, col="steelblue", xlab="Age", 
        main="Histogram with Normal Curve for Employee Ages") 
xfit<-seq(min(x),max(x),length=40) 
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
yfit <- yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="red", lwd=2)

  • Most employees have Ages between 25 to 35.

We will see the relationship between Age and Attrition by the following barplot.

ggplot(data1, aes_string(x = "Age", fill = "Attrition")) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  geom_abline(slope = 0, intercept = .16) + xlab("Age") + ylab("Percent numbers of employees")

  • Employees with Ages between 18 - 21 don’t stay in the same job for long time.
  • People with Age 58-60 don’t leave job.
  • People with Age range 30-50 stay with job. They want to build their careers with the same company.

By the following Scatterplot, we can see the relationship between Age and Monthly Income.

ggplot(data1, aes_string(x = "Age", y = "MonthlyIncome")) + geom_point() + geom_smooth(method="lm")

By the histogram, I will divide Age into 4 groups: 18-25, 25-35, 35-45, 45-60.

data1$AgeGroup <- cut(data1$Age, c(18,25,35,45,60), labels = c("18-25","25-35","35-45","45-60"), include.lowest = TRUE)

Now I will study the relationship between Age groups and Attrition.

ggplot(data1, aes_string(x = "AgeGroup", fill = "Attrition")) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  geom_abline(slope = 0, intercept = .16) + xlab("Age Groups") + ylab("Percentage of employees")

  • Employees with Age range 18-25 will leave their current jobs more than other groups.
  • Employees with Age range 35-45 will stay with their jobs to build their careers.

We can also see the relationship between Age Groups and Monthly Income here.

ggplot(data1, aes_string(x = "AgeGroup", y = "MonthlyIncome")) + geom_point() + geom_smooth(method="lm")

ggplot(data1, aes_string(x = "AgeGroup", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Age Groups")+ xlab("Age Groups") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Logically, employees with higher Ages have bigger Income.

4. Business Travel:

summary(data1$BusinessTravel)
##        Non-Travel Travel_Frequently     Travel_Rarely 
##                94               158               618
stats(data1,"BusinessTravel")
BusinessTravel Count Proportion
Non-Travel 94 10.8%
Travel_Frequently 158 18.2%
Travel_Rarely 618 71.0%
  • Most employees travel rarely (618/870 = 71%).

Now I will study the relationship between Bussiness Travel and Age.

ggplot(data1, aes_string(x = "AgeGroup", fill = "BusinessTravel")) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  geom_abline(slope = 0, intercept = .16) + xlab("Age Groups") + ylab("Business Travel")

  • Then the biggest percentage of frequent travel are for Age Group 25-35.
  • The biggest percenatge of non-travel are for Age Group 35-45. They have stable job and family.
  • The smallest percentage of non-travel are for Age Group 45-60.

Now we will see the relationship between Business Travel and Attrition.

ggplot(data1, aes_string(x = "BusinessTravel", fill = "Attrition")) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  geom_abline(slope = 0, intercept = .16) + xlab("Business Travel") + ylab("Attrition")

  • The most frequent travelers have the highest attrition rates.
ggplot(data1, aes_string(x = "BusinessTravel", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Business Travel Groups")+ xlab("Business Travel") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Non-Travel employee group has the lowest income.

5. Department:

summary(data1$Department)
##        Human Resources Research & Development                  Sales 
##                     35                    562                    273
stats(data1, "Department")
Department Count Proportion
Human Resources 35 4.02%
Research & Development 562 64.6%
Sales 273 31.4%
  • 64.6 % employees work in Research and Developpment Department.
  • Only 4.02% employees work in Human Resources.
  • The job market needs more people for R&D or Sales Department.
ggplot(data1, aes_string(x = "Department", fill = "Attrition")) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  geom_abline(slope = 0, intercept = .16) + xlab("Department") + ylab("Attrition")

  • The Sales Department has the highest rates in Attrition.
ggplot(data1, aes_string(x = "Department", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Department")+ xlab("Department") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • The Mean Incomes are similar between Department groups.
  • The Median Income is strongly different. HR Department has the lowest Median Income and Sales Department has the highest Median Income.

6. Distance from Home:

The following histogram will show us the distance from home of Employees.

x <- data1$DistanceFromHome
h<-hist(x, breaks=10, col="steelblue", xlab="Distance from Home", 
        main="Distance from Home vs. Number of Employees") 
xfit<-seq(min(x),max(x),length=40) 
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
yfit <- yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="red", lwd=2)

  • Most employees work near home (less than 10 miles).
ggplot(data1, aes_string(x = "DistanceFromHome", fill = "Attrition")) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  geom_abline(slope = 0, intercept = .16) + xlab("Distance from Home") + ylab("Attrition")

  • The highest rates in Attrition for the Distance from Home between 21-23 miles. Actually, I don’t see strong relationship here.

7. Education:

stats(data1, "Education")
Education Count Proportion
1 98 11.3%
2 182 20.9%
3 324 37.2%
4 240 27.6%
5 26 2.99%
ggplot(data1, aes_string(x = "Education", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("Education") + ylab("Attrition")

ggplot(data1, aes_string(x = "Education", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Education")+ xlab("Education") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Most employees has the level 3 in Education.
  • Higher level in Education has lower rates in Attrition.
  • The Highest level in Education (level 5) has the Highest Monthly Income.

8. Education Field:

stats(data1, "EducationField")
EducationField Count Proportion
Human Resources 15 1.72%
Life Sciences 358 41.1%
Marketing 100 11.5%
Medical 270 31.0%
Other 52 5.98%
Technical Degree 75 8.62%
ggplot(data1, aes_string(x = "EducationField", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("Education Field") + ylab("Attrition")

ggplot(data1, aes_string(x = "EducationField", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Education Field")+ xlab("Education Field") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Most employees has formation in Life Science.
  • The lowest median income is in Human Resouce but the highest pay is in HR.
  • The highest median income is in Marketing field.

9. Environment Satisfaction:

stats(data1, "EnvironmentSatisfaction")
EnvironmentSatisfaction Count Proportion
1 172 19.8%
2 178 20.5%
3 258 29.7%
4 262 30.1%
ggplot(data1, aes_string(x = "EnvironmentSatisfaction", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("Environment Satisfaction") + ylab("Attrition")

ggplot(data1, aes_string(x = "EnvironmentSatisfaction", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Environment Satisfaction")+ xlab("Environment Satisfaction") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Most employees are sastified with their jobs.
  • Employees who are less sastified with their jobs have the highest rates in Attrition.

10. Gender:

stats(data1, "Gender")
Gender Count Proportion
Female 354 40.7%
Male 516 59.3%
ggplot(data1, aes_string(x = "Gender", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("Gender") + ylab("Attrition")

ggplot(data1, aes_string(x = "Gender", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Gender")+ xlab("Gender") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • The rates in Attrition variable are similar between Male and Female in Gender variable.
  • Male group gain less than Female group in term of Median Income.

11. Job Involvement:

stats(data1, "JobInvolvement")
JobInvolvement Count Proportion
1 47 5.40%
2 228 26.2%
3 514 59.1%
4 81 9.31%
ggplot(data1, aes_string(x = "JobInvolvement", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("Job Involvement") + ylab("Attrition")

ggplot(data1, aes_string(x = "JobInvolvement", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Job Involvement")+ xlab("Job Involvement") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Employees with lower job involvement have higher rates in Attrition.

12. Job Level:

stats(data1, "JobLevel")
JobLevel Count Proportion
1 329 37.8%
2 312 35.9%
3 132 15.2%
4 60 6.90%
5 37 4.25%
ggplot(data1, aes_string(x = "JobLevel", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("Job Level") + ylab("Attrition")

ggplot(data1, aes_string(x = "JobLevel", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Job Level")+ xlab("Job Level") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Number of employees is lower when their job levels are higher.
  • Employees with the lowest job level (1) have the highest rates in Attrition.
  • The relationship between Job Level and Monthly Income is linear positively.

13. Job Role:

stats(data1, "JobRole")
JobRole Count Proportion
Healthcare Representative 76 8.74%
Human Resources 27 3.10%
Laboratory Technician 153 17.6%
Manager 51 5.86%
Manufacturing Director 87 10.0%
Research Director 51 5.86%
Research Scientist 172 19.8%
Sales Executive 200 23.0%
Sales Representative 53 6.09%
ggplot(data1, aes_string(x = "JobRole", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("Job Role") + ylab("Attrition")

ggplot(data1, aes_string(x = "JobRole", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Job Role")+ xlab("Job Role") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Sales Representatives have the highest rates in Attrition and low pay.
  • Manufacturing Directors and Reseach Directors have the lowest rates in Attrition.
  • Managers and Research Directors have the highest pay and low rates in Attrition.

14. Job Satisfaction:

stats(data1, "JobSatisfaction")
JobSatisfaction Count Proportion
1 179 20.6%
2 166 19.1%
3 254 29.2%
4 271 31.1%
ggplot(data1, aes_string(x = "JobSatisfaction", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("Job Satisfaction") + ylab("Attrition")

ggplot(data1, aes_string(x = "JobSatisfaction", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Job Satisfaction")+ xlab("Job Satisfaction") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Most employees are satisfied with their jobs.
  • Those employees with lower job satisfaction level have higher rates in Attrition.
  • Mean and Median Incomes are similar.

15. Marital Status:

stats(data1, "MaritalStatus")
MaritalStatus Count Proportion
Divorced 191 22.0%
Married 410 47.1%
Single 269 30.9%
ggplot(data1, aes_string(x = "MaritalStatus", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("Marital Status") + ylab("Attrition")

ggplot(data1, aes_string(x = "MaritalStatus", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Marital Status")+ xlab("Marital Status") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Most employees are married.
  • Single employees have the highest attrition and below average pay.

16. Number of Companies Worked:

stats(data1, "NumCompaniesWorked")
NumCompaniesWorked Count Proportion
0 111 12.8%
1 320 36.8%
2 74 8.51%
3 91 10.5%
4 85 9.77%
5 43 4.94%
6 39 4.48%
7 46 5.29%
8 28 3.22%
9 33 3.79%
ggplot(data1, aes_string(x = "NumCompaniesWorked", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("Number of Companies Worked") + ylab("Attrition")

ggplot(data1, aes_string(x = "NumCompaniesWorked", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Number of Companies Worked")+ xlab("Number of Companies Worked") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • In the dataset, we can see the number of companies that employees worked is 0. Then it is difficult to understand the concept. I think that the number of companies that employees had worked before starting the job in this company in order to make sense.
  • Who have worked at some companies have higher rates in Attrition.

17. Over Time:

stats(data1, "OverTime")
OverTime Count Proportion
No 618 71.0%
Yes 252 29.0%
ggplot(data1, aes_string(x = "OverTime", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("Overtime") + ylab("Attrition")

ggplot(data1, aes_string(x = "OverTime", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Overtime")+ xlab("Overtime") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Most employees don’t work overtime.
  • Who have to work overtime have higher rates in Attrition and lower income.

18. Percent Salary Hike:

stats(data1, "PercentSalaryHike")
PercentSalaryHike Count Proportion
11 126 14.5%
12 119 13.7%
13 123 14.1%
14 120 13.8%
15 54 6.21%
16 43 4.94%
17 56 6.44%
18 57 6.55%
19 40 4.60%
20 27 3.10%
21 33 3.79%
22 30 3.45%
23 17 1.95%
24 14 1.61%
25 11 1.26%
ggplot(data1, aes_string(x = "PercentSalaryHike", fill = "Attrition")) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  geom_abline(slope = 0, intercept = .16) + xlab("Percent Salary Hike") + ylab("Attrition")

ggplot(data1, aes_string(x = "PercentSalaryHike", y = "MonthlyIncome", fill = x)) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Percent Salary Hike")+ xlab("Percent Salary Hike") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Who have Percent Salary Hike between 22-24% have lower Mean Monthly Income and higher rates in Attrition.

19. Performance Rating:

stats(data1, "PerformanceRating")
PerformanceRating Count Proportion
3 738 84.8%
4 132 15.2%
ggplot(data1, aes_string(x = "PerformanceRating", fill = "Attrition")) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  geom_abline(slope = 0, intercept = .16) + xlab("Performance Rating") + ylab("Attrition")

ggplot(data1, aes_string(x = "PerformanceRating", y = "MonthlyIncome")) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Performance Rating")+ xlab("Performance Rating") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • There are only 2 ratings and the results are similar.
  • I will want to remove this variable because it is a self rating and the results will not affect to our analysis.

20. Relationship Satisfaction:

stats(data1, "RelationshipSatisfaction")
RelationshipSatisfaction Count Proportion
1 174 20.0%
2 171 19.7%
3 261 30.0%
4 264 30.3%
ggplot(data1, aes_string(x = "RelationshipSatisfaction", fill = "Attrition")) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  geom_abline(slope = 0, intercept = .16) + xlab("Relationship Satisfaction") + ylab("Attrition")

ggplot(data1, aes_string(x = "RelationshipSatisfaction", y = "MonthlyIncome")) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Relationship Satisfaction")+ xlab("Relationship Satisfaction") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Relationship Satisfaction divides into 4 groups similarly.
  • Employees with low Relationship Satisfaction have high rates in Attrition.
  • Mean and Median Income are similar between groups.

21. Stock Option Level:

stats(data1, "StockOptionLevel")
StockOptionLevel Count Proportion
0 379 43.6%
1 355 40.8%
2 81 9.31%
3 55 6.32%
ggplot(data1, aes_string(x = "StockOptionLevel", fill = "Attrition")) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  geom_abline(slope = 0, intercept = .16) + xlab("Stock Option Level") + ylab("Attrition")

ggplot(data1, aes_string(x = "StockOptionLevel", y = "MonthlyIncome")) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Stock Option Level")+ xlab("Stock Option Level") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Most employees fall in Stock Option level 0 or 1.
  • Stock option levels 0 and 3 have the highest rates in Attrition and lowest median incomes.

22. Total Working Years:

stats(data1, "TotalWorkingYears")
TotalWorkingYears Count Proportion
0 7 0.805%
1 48 5.52%
2 20 2.30%
3 27 3.10%
4 35 4.02%
5 49 5.63%
6 72 8.28%
7 47 5.40%
8 61 7.01%
9 58 6.67%
10 132 15.2%
11 20 2.30%
12 23 2.64%
13 25 2.87%
14 22 2.53%
15 29 3.33%
16 24 2.76%
17 17 1.95%
18 13 1.49%
19 13 1.49%
20 17 1.95%
21 17 1.95%
22 14 1.61%
23 14 1.61%
24 9 1.03%
25 6 0.690%
26 8 0.920%
27 3 0.345%
28 6 0.690%
29 4 0.460%
30 5 0.575%
31 6 0.690%
32 6 0.690%
33 2 0.230%
34 3 0.345%
35 2 0.230%
36 4 0.460%
37 1 0.115%
40 1 0.115%
ggplot(data1, aes_string(x = "TotalWorkingYears", fill = "Attrition")) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  geom_abline(slope = 0, intercept = .16) + xlab("Total Working Years") + ylab("Attrition")

ggplot(data1, aes_string(x = "TotalWorkingYears", y = "MonthlyIncome")) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Total Working Years")+ xlab("Total Working Years") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Income increases linearly with total working years.
  • Employees have lower total working years have higher rates in Attrition.
  • Especialy, who have 40 years of working have 100% rates in Attrition (retirement?)
  • Employees have 34-39 years of working don’t leave jobs, waiting for retirement and full benefits?

23. Training Times Last Year:

stats(data1, "TrainingTimesLastYear")
TrainingTimesLastYear Count Proportion
0 30 3.45%
1 39 4.48%
2 309 35.5%
3 308 35.4%
4 73 8.39%
5 75 8.62%
6 36 4.14%
ggplot(data1, aes_string(x = "TrainingTimesLastYear", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("Training Times Last Year") + ylab("Attrition")

ggplot(data1, aes_string(x = "TrainingTimesLastYear", y = "MonthlyIncome")) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Training Times Last Year")+ xlab("Training Times Last Years") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Most Employees had 2 or 3 training times last year.
  • Employees had 0 or 4 training times last year have higher rates in Attrition.
  • Employees had 5 or 6 training times last year have the lowest rates in Attrition.
  • Suprisingly, who had no training time last year have the highest Median and Mean Income.

24. Work Life Balance:

stats(data1, "WorkLifeBalance")
WorkLifeBalance Count Proportion
1 48 5.52%
2 192 22.1%
3 532 61.1%
4 98 11.3%
ggplot(data1, aes_string(x = "WorkLifeBalance", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("Work Life Balance") + ylab("Attrition")

ggplot(data1, aes_string(x = "WorkLifeBalance", y = "MonthlyIncome")) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Work Life Balance")+ xlab("Work Life Balance") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Those have bad work life balance then have higher rates in Attrition and lower income.

25. Years At Company:

stats(data1, "YearsAtCompany")
YearsAtCompany Count Proportion
0 28 3.22%
1 107 12.3%
2 77 8.85%
3 85 9.77%
4 54 6.21%
5 105 12.1%
6 44 5.06%
7 42 4.83%
8 50 5.75%
9 52 5.98%
10 77 8.85%
11 17 1.95%
12 9 1.03%
13 20 2.30%
14 14 1.61%
15 11 1.26%
16 6 0.690%
17 5 0.575%
18 8 0.920%
19 8 0.920%
20 14 1.61%
21 6 0.690%
22 12 1.38%
23 1 0.115%
24 4 0.460%
25 2 0.230%
26 3 0.345%
30 1 0.115%
31 2 0.230%
32 3 0.345%
33 2 0.230%
40 1 0.115%
ggplot(data1, aes_string(x = "YearsAtCompany", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("Years At Company") + ylab("Attrition")

ggplot(data1, aes_string(x = "YearsAtCompany", y = "MonthlyIncome")) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. Years At Company")+ xlab("Years At Company") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Income increases linearly with Years at Company.
  • Employees have lower Years at Company have higher rates in Attrition.
  • Especialy, who have 40 Years at Company have 100% rates in Attrition (retirement?)

26. Years In Current Role:

stats(data1, "YearsInCurrentRole")
YearsInCurrentRole Count Proportion
0 151 17.4%
1 38 4.37%
2 223 25.6%
3 68 7.82%
4 53 6.09%
5 26 2.99%
6 17 1.95%
7 136 15.6%
8 56 6.44%
9 40 4.60%
10 14 1.61%
11 15 1.72%
12 7 0.805%
13 9 1.03%
14 7 0.805%
15 3 0.345%
16 3 0.345%
17 3 0.345%
18 1 0.115%
ggplot(data1, aes_string(x = "YearsInCurrentRole", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("YearsInCurrentRole") + ylab("Attrition")

ggplot(data1, aes_string(x = "YearsInCurrentRole", y = "MonthlyIncome")) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. YearsInCurrentRole")+ xlab("YearsInCurrentRole") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • It’s a linear relationship between Monthly Income and Years in Current Role.
  • Who stay more than 15 years in current role don’t leave their jobs.

27. Years Since Last Promotion:

stats(data1, "YearsSinceLastPromotion")
YearsSinceLastPromotion Count Proportion
0 342 39.3%
1 214 24.6%
2 94 10.8%
3 32 3.68%
4 32 3.68%
5 30 3.45%
6 23 2.64%
7 41 4.71%
8 12 1.38%
9 9 1.03%
10 4 0.460%
11 14 1.61%
12 5 0.575%
13 5 0.575%
14 5 0.575%
15 8 0.920%
ggplot(data1, aes_string(x = "YearsSinceLastPromotion", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("YearsSinceLastPromotion") + ylab("Attrition")

ggplot(data1, aes_string(x = "YearsSinceLastPromotion", y = "MonthlyIncome")) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Monthly Income vs. YearsSinceLastPromotion")+ xlab("YearsSinceLastPromotion") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • The highest Mean Income is for 12 Years and zero rate in Attrition.

28. Years With Current Manager:

stats(data1, "YearsWithCurrManager")
YearsWithCurrManager Count Proportion
0 166 19.1%
1 40 4.60%
2 202 23.2%
3 76 8.74%
4 51 5.86%
5 22 2.53%
6 12 1.38%
7 131 15.1%
8 68 7.82%
9 44 5.06%
10 18 2.07%
11 11 1.26%
12 13 1.49%
13 7 0.805%
14 4 0.460%
15 1 0.115%
17 4 0.460%
ggplot(data1, aes_string(x = "YearsWithCurrManager", fill = "Attrition")) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) + 
    geom_abline(slope = 0, intercept = .16) + xlab("YearsWithCurrManager") + ylab("Attrition")

ggplot(data1, aes_string(x = "YearsWithCurrManager", y = "MonthlyIncome")) + 
  geom_boxplot(colour = "black", fill = "steelblue")+ ggtitle("Distribution between Monthly Income vs. YearsWithCurrManager")+ xlab("YearsWithCurrManager") + ylab("Monthly Income") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) 

  • Most employees have 2 years working with the current managers. After that, maybe they move to another companies and promote to next levels?
  • People with 12, 13, 15,16 years with the same managers have zero rate in Attrition.
  • Employees have 14 years with the same managers have the highest Mean and Median Income.

29. Hourly Rate vs Daily Rate vs Monthly Rate vs Monthly Income:

There are 4 similar variables HourlyRate, DailyRate, MonthlyRate, MonthlyIncome. I will see the relationship of these 3 first variables with MonthlyIncome and these 4 variables with Attrition.

ggpairs(data = data1, 
              mapping = aes(color = Attrition),
              columns = c("HourlyRate","DailyRate","MonthlyRate","MonthlyIncome"))

  • Weak relationship between HourlyRate/DailyRate/MonthlyRate.
  • No meaningful relationship between MonthlyIncome with HourlyRate/DailyRate/MonthlyRate.
  • No meaningful relationship between Attrition with HourlyRate/DailyRate/MonthlyRate.

Then I will not consider these variables in the future analysis.

30. Satisfaction level:

There are 3 variables related to Satisfaction level: EnvironmentSatisfaction, JobSatisfaction, RelationshipSatisfaction.

First, I will change these variables from factor to numeric.

numcolumns <- c("EnvironmentSatisfaction", "JobSatisfaction", "RelationshipSatisfaction")
data1[,numcolumns] <- lapply(data1[,numcolumns], as.numeric)

data1$Satisfaction <-   as.factor(round((data1$EnvironmentSatisfaction +data1$JobSatisfaction+ 
                                     data1$RelationshipSatisfaction)/3))


stats(data1, "Satisfaction")
Satisfaction Count Proportion
1 24 2.76%
2 296 34.0%
3 455 52.3%
4 95 10.9%
ggplot(data1, aes_string(x = "Satisfaction", fill = "Attrition")) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  geom_abline(slope = 0, intercept = .16)

ggplot(data1, aes_string(x = "Satisfaction", fill = "IncomeGroup")) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  geom_abline(slope = 0, intercept = .16)

  • Most employees have level 2 or 3 in Satisfaction.
  • Employees with the level 4 in Satisfaction have the lowest rates in Attrition.
  • More employees with the highest Salary and less employees with the lowest Salary have the level 4 in Satisfaction.

32. Conclusion:

  • As the above analyses, I will keep the variable MonthlyIncome and drop 3 variables HourlyRate, DailyRate and MonthlyRate.
  • I will also drop the PerformanceRating variable.

After the first analyses on each variable,I will create a new dataset as follows.

data2 <- select(data0, -c("HourlyRate","DailyRate", "MonthlyRate","PerformanceRating"))
skim(data2)
Data summary
Name data2
Number of rows 870
Number of columns 27
_______________________
Column type frequency:
factor 12
numeric 15
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Attrition 0 1 FALSE 2 No: 730, Yes: 140
BusinessTravel 0 1 FALSE 3 Tra: 618, Tra: 158, Non: 94
Department 0 1 FALSE 3 Res: 562, Sal: 273, Hum: 35
EducationField 0 1 FALSE 6 Lif: 358, Med: 270, Mar: 100, Tec: 75
Gender 0 1 FALSE 2 Mal: 516, Fem: 354
JobInvolvement 0 1 FALSE 4 3: 514, 2: 228, 4: 81, 1: 47
JobRole 0 1 FALSE 9 Sal: 200, Res: 172, Lab: 153, Man: 87
JobSatisfaction 0 1 FALSE 4 4: 271, 3: 254, 1: 179, 2: 166
MaritalStatus 0 1 FALSE 3 Mar: 410, Sin: 269, Div: 191
OverTime 0 1 FALSE 2 No: 618, Yes: 252
RelationshipSatisfaction 0 1 FALSE 4 4: 264, 3: 261, 1: 174, 2: 171
WorkLifeBalance 0 1 FALSE 4 3: 532, 2: 192, 4: 98, 1: 48

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 36.83 8.93 18 30.0 35.0 43 60 ▂▇▇▃▂
DistanceFromHome 0 1 9.34 8.14 1 2.0 7.0 14 29 ▇▅▂▂▂
Education 0 1 2.90 1.02 1 2.0 3.0 4 5 ▂▅▇▆▁
EnvironmentSatisfaction 0 1 2.70 1.10 1 2.0 3.0 4 4 ▅▆▁▇▇
JobLevel 0 1 2.04 1.09 1 1.0 2.0 3 5 ▇▇▃▂▁
MonthlyIncome 0 1 6390.26 4597.70 1081 2839.5 4945.5 8182 19999 ▇▅▂▁▁
NumCompaniesWorked 0 1 2.73 2.52 0 1.0 2.0 4 9 ▇▃▂▂▁
PercentSalaryHike 0 1 15.20 3.68 11 12.0 14.0 18 25 ▇▅▃▂▁
StockOptionLevel 0 1 0.78 0.86 0 0.0 1.0 1 3 ▇▇▁▂▁
TotalWorkingYears 0 1 11.05 7.51 0 6.0 10.0 15 40 ▇▇▂▁▁
TrainingTimesLastYear 0 1 2.83 1.27 0 2.0 3.0 3 6 ▂▇▇▂▃
YearsAtCompany 0 1 6.96 6.02 0 3.0 5.0 10 40 ▇▃▁▁▁
YearsInCurrentRole 0 1 4.20 3.64 0 2.0 3.0 7 18 ▇▃▂▁▁
YearsSinceLastPromotion 0 1 2.17 3.19 0 0.0 1.0 3 15 ▇▁▁▁▁
YearsWithCurrManager 0 1 4.14 3.57 0 2.0 3.0 7 17 ▇▂▅▁▁